NOTE: if you are not sure why the following exercises are useful or relevant to the multiple regression session then bear with me to the end; it will become clearer!

Sequences and designs for sampling and allocation

One trick you may find useful later in the course is making sequences of numbers.

There are a few ways to do this, but the simplest is to write: 1:10. That is, the number to start from (1), a colon (:), and then the number to end with (10).

Copy and paste these examples to see the output:

1:10
##  [1]  1  2  3  4  5  6  7  8  9 10
20:30
##  [1] 20 21 22 23 24 25 26 27 28 29 30

Explanation: The output shows that R has created a sequence of whole numbers between the start and finish number.

To get a sequence with only even numbers, we can use the seq function, and set the by argument to 2:

seq(from=2, to=10, by=2)
## [1]  2  4  6  8 10

You can set by to any number, including a decimal:

seq(0, 27, by=3)
##  [1]  0  3  6  9 12 15 18 21 24 27
seq(0, 1, by=0.2)
## [1] 0.0 0.2 0.4 0.6 0.8 1.0

If your sequence doesn’t have a simple pattern, you can also write out the numbers by hand using the c() function E.g.:

c(1,40,92,188)
## [1]   1  40  92 188

Explanation: c(...) is short for combine, so this command combines the numbers 1, 40, 92, 188 into a new sequence. This is sometimes called a vector in R-speak.

Make some sequences which include:

  • Even numbers from 10 to 20
  • Numbers in the 8 times table less than 200
  • 20 evenly spaced numbers between zero and 1 (including zero and 1)
  • The words “Wibble”, “Wobble” and “Bobble”

We can use seq for numbers:

seq(10,20,by=2)
## [1] 10 12 14 16 18 20
seq(0,200, 8)
##  [1]   0   8  16  24  32  40  48  56  64  72  80  88  96 104 112 120 128 136 144
## [20] 152 160 168 176 184 192 200
seq(0,1, by=1/19)
##  [1] 0.00000000 0.05263158 0.10526316 0.15789474 0.21052632 0.26315789
##  [7] 0.31578947 0.36842105 0.42105263 0.47368421 0.52631579 0.57894737
## [13] 0.63157895 0.68421053 0.73684211 0.78947368 0.84210526 0.89473684
## [19] 0.94736842 1.00000000

But we need to use c() for lists of words:

c("Wibble", "Wobble", "Bobble")
## [1] "Wibble" "Wobble" "Bobble"

Combinations of sequences

In designing experiments, or creating a grid of numbers for making predictions, we often want to create combinations of different categories which represent conditions or stimuli.

Imagine a hypothetical study with a test phase where participants are presented with multiple words, in either red or green text, and shown at either the bottom or top of the computer screen.

The combinations look something like this:

condition colour position word
1 Red Top Nobble
2 Green Top Nobble
3 Red Bottom Nobble
4 Green Bottom Nobble
5 Red Top Wobble
6 Green Top Wobble
7 Red Bottom Wobble
8 Green Bottom Wobble
9 Red Top Hobble
10 Green Top Hobble
11 Red Bottom Hobble
12 Green Bottom Hobble

R provides quick ways of creating combinations of variables, using a command called expand.grid.

First, we need to create a sequence of each of the possible values for our categories:

colours = c("Red", "Green")
positions = c("Top", "Bottom")
words = c("Nobble", "Wobble", "Hobble")

Then we can use expand.grid to give us all the possible combinations of these:

expand.grid(colour=colours, position=positions, words = words)
##    colour position  words
## 1     Red      Top Nobble
## 2   Green      Top Nobble
## 3     Red   Bottom Nobble
## 4   Green   Bottom Nobble
## 5     Red      Top Wobble
## 6   Green      Top Wobble
## 7     Red   Bottom Wobble
## 8   Green   Bottom Wobble
## 9     Red      Top Hobble
## 10  Green      Top Hobble
## 11    Red   Bottom Hobble
## 12  Green   Bottom Hobble

Explanation: The expand.grid function has taken the items in the three input sequences (colours, positions and words) and created a dataframe which contains all the possible combinations. We could save these to a file if we wanted to use them as part of our experiment.

Create some experimental designs of your own

  1. Reproduce the experiment design above by copying and pasting

  2. Adapt the commands to allow for an experiment where the word position could be either top, bottom, left or right. How many different conditions would there be in this case?

As a stretch task (this might be sligtly harder): How would you create a design where a sequence of 3 words is presented. Each word must be different, but each 3-word combination can be presented in red or green text, and at the top or bottom of the screen?

expand.grid(colour=colours, position=positions, word1 = words, word2=words, word3=words) %>% 
  filter(
    word1 != word2 & word2 != word3 & word1 != word3
  ) 
##    colour position  word1  word2  word3
## 1     Red      Top Hobble Wobble Nobble
## 2   Green      Top Hobble Wobble Nobble
## 3     Red   Bottom Hobble Wobble Nobble
## 4   Green   Bottom Hobble Wobble Nobble
## 5     Red      Top Wobble Hobble Nobble
## 6   Green      Top Wobble Hobble Nobble
## 7     Red   Bottom Wobble Hobble Nobble
## 8   Green   Bottom Wobble Hobble Nobble
## 9     Red      Top Hobble Nobble Wobble
## 10  Green      Top Hobble Nobble Wobble
## 11    Red   Bottom Hobble Nobble Wobble
## 12  Green   Bottom Hobble Nobble Wobble
## 13    Red      Top Nobble Hobble Wobble
## 14  Green      Top Nobble Hobble Wobble
## 15    Red   Bottom Nobble Hobble Wobble
## 16  Green   Bottom Nobble Hobble Wobble
## 17    Red      Top Wobble Nobble Hobble
## 18  Green      Top Wobble Nobble Hobble
## 19    Red   Bottom Wobble Nobble Hobble
## 20  Green   Bottom Wobble Nobble Hobble
## 21    Red      Top Nobble Wobble Hobble
## 22  Green      Top Nobble Wobble Hobble
## 23    Red   Bottom Nobble Wobble Hobble
## 24  Green   Bottom Nobble Wobble Hobble

Explanation of the command: There may be a neater way of doing this, but here we simply:

  • Create all combinations of words, colors, spatial locations and positions in the sequence (1,2 or 3)
  • Filter our rows where two of the words are the same

Another way to do the same thing using the unique and length functions might be:

expand.grid(colour=colours, position=positions, word1 = words, word2=words, word3=words) %>% 
  rowwise() %>% 
  filter(length(unique(c(word1, word2, word3)))==3)
## # A tibble: 24 x 5
## # Rowwise: 
##    colour position word1  word2  word3 
##    <fct>  <fct>    <fct>  <fct>  <fct> 
##  1 Red    Top      Hobble Wobble Nobble
##  2 Green  Top      Hobble Wobble Nobble
##  3 Red    Bottom   Hobble Wobble Nobble
##  4 Green  Bottom   Hobble Wobble Nobble
##  5 Red    Top      Wobble Hobble Nobble
##  6 Green  Top      Wobble Hobble Nobble
##  7 Red    Bottom   Wobble Hobble Nobble
##  8 Green  Bottom   Wobble Hobble Nobble
##  9 Red    Top      Hobble Nobble Wobble
## 10 Green  Top      Hobble Nobble Wobble
## # … with 14 more rows

Explanation: Here we apply the sample filter, but using the length function to count the number of unique words among word 1, 2 and 3. The rowwise function is used to make R consider each row individually (it won’t work without it because R tries to consider the whole of a column at once).

Take the model from the main worksheet were we predicted grades from work hours for men and women.

In the main worksheet we created a dataframe by hand to tell augment what predictions we wanted.

Now try using expand.grid to make the new dataframe instead. For example, try making predictions for men and women who work 20, 25, 30, 35, or 40 hours per week. Make this dataframe using expand.grid and without using the c() function.

# setup and read data
library(tidyverse)
studyhabits <- read_csv('https://benwhalley.github.io/rmip/data/studyhabitsandgrades.csv')

# run the model
second.model <- lm(grade ~ work_hours * female, data = studyhabits)

# made the new data grid for predictions
prediction.grid <- expand.grid(work_hours=seq(20,40,5), female=c(TRUE, FALSE)) %>% 
  as_tibble
prediction.grid
## # A tibble: 10 x 2
##    work_hours female
##         <dbl> <lgl> 
##  1         20 TRUE  
##  2         25 TRUE  
##  3         30 TRUE  
##  4         35 TRUE  
##  5         40 TRUE  
##  6         20 FALSE 
##  7         25 FALSE 
##  8         30 FALSE 
##  9         35 FALSE 
## 10         40 FALSE
# make the predictions
new.predictions <- broom::augment(second.model, newdata=prediction.grid)
new.predictions
## # A tibble: 10 x 4
##    work_hours female .fitted .se.fit
##         <dbl> <lgl>    <dbl>   <dbl>
##  1         20 TRUE      55.5   0.821
##  2         25 TRUE      59.1   0.542
##  3         30 TRUE      62.7   0.740
##  4         35 TRUE      66.3   1.20 
##  5         40 TRUE      69.9   1.72 
##  6         20 FALSE     54.5   1.29 
##  7         25 FALSE     63.5   1.46 
##  8         30 FALSE     72.5   2.55 
##  9         35 FALSE     81.4   3.84 
## 10         40 FALSE     90.4   5.19
# plot predictions from the model with SE
new.predictions %>% 
  ggplot(aes(work_hours, 
             y=.fitted, 
             ymin= .fitted-.se.fit, 
             ymax= .fitted+.se.fit, 
             color=female, group=female)) +
  geom_line() +
  geom_errorbar(width=.1) +
  geom_point()

# geom_ribbon is another way of showing the error of estimates
new.predictions %>% 
  ggplot(aes(work_hours, 
             y=.fitted, 
             ymin= .fitted-.se.fit, 
             ymax= .fitted+.se.fit, 
             color=female, group=female)) +
  geom_ribbon(alpha=.1, color=NA) +
  geom_line()

Clinical trial example

Data from a clinical trial of Functional Imagery Training (Solbrig et al. 2019, FIT) are available at https://zenodo.org/record/1120364/files/blind_data.csv. In this file, group represents the treatment group (FIT=2; Motivational Interviewing=1). The kg1 and kg3 variables represent the patients’ weights in kilograms before and after treatment, respectively. Load these data and complete the following tasks:

  1. Plot the difference in weight between treatment groups at followup (kg3)

  2. Create a plot to show whether men or women benefited most from the treatment (this will require some thinking about what goes on the y-axis, and perhaps some pre-processing of the data).

  3. Create a plot to show whether older participants benefited more or less than younger ones (again this will require some thinking, and there are quite a number of different plot types which could be used, each with different pros and cons).

  4. Run a linear model which is equivalent to the plot you created. Can you match the coefficients in the model output to the points and lines on your graph?

Preview: Multiple predictors

In the previous example plotting our data was quite simple: We had an x and y axis, and used colour to show the third.

If we have two continuous predictors we can do the same, using color or size to indicate the third dimension. Here are two examples using some data on the income, murder and illiteracy rates of US states:

# the data called state.x77 are in an old fashioned format, 
# so do this first to make them into a dataframe
states <- as_tibble(state.x77)
states %>% 
  ggplot(aes(x = Murder, y = Income, color=Illiteracy)) + geom_point()
3 dimensions using x, y and color

3 dimensions using x, y and color

states %>% 
  ggplot(aes(x = Murder, y = Income, size=Illiteracy)) + geom_point()
3 dimensions using x, y and point size

3 dimensions using x, y and point size

Another alternative is to categorize one of the variables and make a faceted plot:

states %>% 
  mutate(Illiteracy_categorised = cut(Illiteracy, 3)) %>% 
  ggplot(aes(Murder, Income)) + geom_point() +
  facet_wrap(~Illiteracy_categorised, labeller = label_both) +
  geom_smooth(method=lm, se=F)


If you look at these plots it’s just about possible to see that there is an interaction here: As Illiteracy increases, the relationship between Murder and Income changes (it starts positive but ends up slightly negative).

Try describing the plot above in words.

Try re-plotting the data but with different values on different axes (e.g. swap x, y, color size etc).

Which feels most ‘natural’ to you, easy to describe?

3d plots

However, another way of displaying these data is to use a true 3d plot. An example is below:

Regression ‘surfaces’

In the previous session we saw we can represent the regression coefficients in a 2d plots. For example:

states %>% 
  ggplot(aes(Income, Illiteracy)) + geom_point() + geom_smooth(method="lm", se=F)

This regression line looks problematic. The residuals (gaps from point to line) are quite large.

But, if we try to add Murder as another predictor of Income we can’t represent the model as a simple line. Instead we need to think of the regression model as a 3d surface:

Play with the interactive plot above.

Is the 3d surface a good ‘fit’ to these data?


In future sessions, as we increase the number of variables in our models, it will be difficult to keep hold of simple graphical equivalents to our models because it’s hard to think in 5 or 6 or more dimensions.

It’s good to remember this type of plot though. As we add more variables the lm function is still trying to minimise the gaps between the predicted line or surface and the actual data.

Solbrig, Linda, Ben Whalley, David J Kavanagh, Jon May, Tracey Parkin, Ray Jones, and Jackie Andrade. 2019. “Functional Imagery Training Versus Motivational Interviewing for Weight Loss: A Randomised Controlled Trial of Brief Individual Interventions for Overweight and Obesity.” International Journal of Obesity 43 (4): 883. https://www.ncbi.nlm.nih.gov/pubmed/30185920.







Return to the index page


All content on this site distributed under a Creative Commons licence. CC-BY-SA 4.0.